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Abstract 

Empirical divergence maximization (EDM) refers to a recently proposed strategy for 
estimating /-divergences and likelihood ratio functions. This paper extends the idea to 
empirical vector quantization where one seeks to empirically derive quantization rules that 
maximize the Kullback-Leibler divergence between two statistical hypotheses. We analyze 
the estimator's error convergence rate leveraging Tsybakov's margin condition and show 
that rates as fast as are possible, where n equals the number of training samples. We 
also show that the Flynn and Gray algorithm can be used to efficiently compute EDM 
estimates and show that they can be efficiently and accurately represented by recursive 
dyadic partitions. The EDM formulation have several advantages. First, the formulation 
gives access to the tools and results of empirical process theory that quantify the estimator's 
error convergence rate. Second, the formulation provides a previously unknown derivation 
for the Flynn and Gray algorithm. Third, the flexibility it affords allows one to avoid a 
small-cell assumption common in other approaches. Finally, we illustrate the potential use 
of the method through an example. 

1 Introduction 

In statistical learning theory, empirical risk minimization is a standard technique whereby clas- 
sifiers are formed from empirical data [l]. The idea is simple enough: when the underlying 
probability distributions characterizing the data are unknown, classifiers are found by minimiz- 
ing an empirical form of the risk (probability of error) over some specified class of classifiers. 
The technique is well-understood and has been generalized to include various cost criteria and 
problem settings. In its generalized form, empirical risk minimization is sometimes referred to 
as M-estimation (the M standing for minimization or maximization) [2]. 

Recently, Nguyen, Wainwright and Jordan [H] applied M-estimation to the estimation of /- 
divergences (the Kullback-Leibler (KL) divergence [1] in particular) and to bounded likelihood 
ratio functions. In this paper, we build on their ideas and develop a method for computing 
empirical quantization rules by maximizing the KL divergence. We call the method empirical 
divergence maximization (EDM) in deference to its similarity to empirical risk minimization 
and because the name is simple and descriptive. The proposed formulation leads to an entirely 
different algorithm for computing the estimators than that employed in [3J , and the convergence 
rates reported here incorporate a margin condition not included in [3j that shows when fast 
convergence is possible. 

As the name suggests, the criterion used in EDM is the KL divergence, a well-known in- 
formation theoretic quantity that has enjoyed a prominent and long-standing place in both 
theory and practice. Applications are numerous and range from detection and estimation prob- 
lems [5H7] to texture retrieval in image databases [8] and from the study of neural coding [9] 
to linguistic problems [10]. Roughly speaking, the KL divergence quantifies the dissimilarity 
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of two probability density functions (pdfs) and is therefore often regarded as a "distance", 
although it is not a distance metric. Stein's Lemma [11, p. 309] fundamentally links the di- 
vergence to hypothesis testing by relating it to the decay rate of different error probabilities. 
In fact, the divergence equals the optimal asymptotic error decay rate of a Neyman- Pearson 
test. Thus increasing the divergence between two statistical hypotheses generally increases their 
discriminability. 

The use of the KL divergence in quantization problems dates back nearly four decades [5] . 
In that time, various problem settings have been investigated including scalar, vector, and 
distributed quantization [5t ll2f[Tl] . Until recently, however, most results addressing this type 
of quantization assumed full knowledge of the probability distributions of interest and did not 
explicitly address empirical designs. Moreover, those works in the quantization literature most 
closely related to the present paper |14pi5) invoke a small-cell assumption that forces partitions, 
designed to maximize the divergence, to resemble nearest neighbor partitions even when such 
partitions cannot well-approximate theoretically optimal partitions (see Fig. [3]) . Because of its 
flexibility, however, the EDM approach overcomes this shortcoming. 

In [15] , Lazebnik and Raginsky study a conceptually similar quantization problem to the one 
considered here, but the differences between the approaches are substantial. For example, their 
information loss criterion is a difference of mutual informations, and while related to the KL 
divergence, this criterion measures a different quantity than the divergence loss studied here. 
Their work is also placed in a machine learning setting where the data and the quantization 
values (labels) are jointly distributed and both play integral roles in their information criterion. 
In this paper, the quantization values play a secondary role in the computation of our estimator. 

To formalize the problem, let P and Q be two probability measures defined on the prob- 
ability space ([0, 1]"^, where B denotes the usual Borel a- algebra and d > 1. Let p and 
q denote the density functions of P and Q with respect to Lebesgue measure and assume 
P and Q are absolutely continuous with respect to one another. Then any quantization 
rule 7 : M'^ i— )• {0, . . . , L — 1} that operates on a random vector X (distributed according 
to P or Q) induces the probability mass functions (pmfs), ^(7) = (po(7)) • • • )Pl-i(7)) and 
(7(7) = {qo{<P), ■ ■ ■ ) 91,-1(7))) where Pi^j) = P{'y{X) = i) and similarly for (72(7). In this con- 
text, the KL divergence is defined as 



In EDM, we maximize an empirical form of the KL divergence over some given class of quan- 
tization rules. We therefore analyze an estimator of the form 



where -D„ (7) represents an empirical KL divergence that is defined in Section [2] (the subscript 
n signifies that it is an empirical quantity that is based on n samples from both P and Q) 
and where F denotes some class of quantization rules. By design EDM estimators constructed 
rules 7„ that induce maximally divergent pmfs, thereby best preserving the discriminability 
of P and Q. In other words, EDM estimators maximize the performance (in terms of KL 
divergence) of any downstream detector or classifier that operates on the quantized data. The 
EDM formulation has several advantages: (i) it readily permits the application of empirical 
process theory which in turn provides the tools to quantify the estimator's error decay rates; (ii) 
it naturally leads to the Flynn and Gray algorithm which efficiently computes the quantization 




7„ = argmax -D„(7), 
7er 
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rules; (iii) it provides a systematic derivation for the Flynn and Gray algorithm; and (iv) the 
flexibility in candidate function classes allows efficient representation of the quantization rules 
and overcomes the small-cell constraint. 



2 Empirical Divergence Maximization 

The form of Dni'j) is taken from recent work by Nguyen et al. [3] and relies on rewriting the 
convex function — log(-) appearing in the definition of the KL divergence. Throughout the 
paper, we hold the number of quantization levels L fixed. 



2.1 Expressing divergence using convex conjugates 

The notion of a convex conjugate is based on the observation that a curve can either be described 
by its graph or by an envelope of tangents [16]. More concretely, a (closed) convex function 
/ : M I— 7- M can be described as the pointwise supremum of a collection of affine functions 
h{t) = tt* — fi* such that the set of all pairs (t*,/i*) lie within the epigraph of its convex 
conjugate f*{t*), i.e., 

/(t) = sup {t*t-nt*)} (1) 

where by duality the convex conjugate f*{t*) of f{t) is defined by 

r(r) = sup {tt*-f{t)}. 

t 

Now suppose 7 is an arbitrary quantization rule defined on [0, 1]*^, 

j{x) = Y,i^RA^), ^e[0,l]', (2) 
i=0 

where {Ri}^^^ is a collection of disjoint sets partitioning [0, 1]*^ and li?i(-) denotes the indicator 
function. Using ([T]), we can write the divergence between the pmfs induced by 7 as 

DKLipmqil)) = J2P^il)f(^^) (3a) 

L-l 



5]p,(7)-sup{r^-r(r)|, (3b) 



i=0 

where f{t) = — log(t) for t > and +00 otherwise. Calculating the convex conjugate, one finds 



fit* 



-1 - log(-t*) if t* <0 
+00 if t* > 0. 
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Substituting this expression into (pbj) , we have the following expressions for the KL divergence 

DKLipmqm 

L-1 , s 

L-1 

= 1+ V sup {P{R,) log(cRj - CR^ Q{Ri)}, 

where in the second step we let cr^ = —t*, and in the last step use the fact that pi{'^) = 
P{Ri), Ri = {x : 7(x) = i}. The validity of last expression is easily verified by differentiating it 
with respect to cr. and solving for the maximizers. By defining the piecewise constant function 

L-1 

<P{x) := ^Ci?,l«,(x), CR^eR+,xe [0,lf (4) 

i=0 

we can write -Dkl(p(7)||9(7)) in integral form: 

1 + sup J / log((/.) dP - [ (I) dQ} , (5) 

</> [^[O.lJ-^ J[0,l]d J 

where the supremum is taken over all functions of the form (jlj) . Note that the (f) which achieves 
the supremum depends on P, Q, and {Ri}^^^. Below, we restrict (p to lie within a (more) specific 
class of rules and define the proposed quantization rule estimator in terms of the empirical 
counterpart to ([5]). 

In addition, note that unlike 7, (j) does not map [0, 1]*^ to a set of indices. We nevertheless 
refer to both as quantization rules since (p only assumes L real values. Note also that in terms of 
KL divergence, (j) determines 7, i.e., if (f> is known, a quantization rule 7 : [0, 1]"^ 1— )• {0, ... , L — 1} 
can be defined that induces the same pmfs as (j). This fact becomes important for the algorithm 
described in Section O 



2.2 Empirical estimator 

To define a function class for (p, we first consider different "labelings" of a uniform partition of 
[0, 1]*^. For a given positive integer J, let vrj denote a tesselation of [0, l]'^ by uniform hypercubes 
Sk, k = 0, . . . , 2'^'^ — 1. To each cell S^, we can associate one of L labels {0, . . . ,L — 1}, and 
thus for each different labeling of vrj, we can define another partition, ttr, with cells 
described by 

Ri= U Sk, i = 0,...,L-l. (6) 

k: label(Sfe)=i 

Now, for a given partition ttr and positive constants m > and M < 00, denote by 
^Tj^{L, J,m,M) the set of all L-level piecewise constant functions defined on ttr that are 
bounded and positive: 

^^^{L,J,m,M) =|(/<(x) =Y,^crMM)- m<CR^< Afj . 
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Letting II/j denote the set of all partitions vr/j, or equivalently the set of all different labelings 
of TTj, we define the candidate class ^{L, J, m, M) of our empirical quantizers as 

$(L,J,m,M):= |J <^^^{L, J,m, M). (7) 

Letting {Xf}^^^ and {Xf}^^-^ be training data distributed according to p and q, respectively, 
we define the function Dn{(p) as an empirical counterpart to ([5]) 

i=l i=l 

and define the proposed empirical quantization rule estimator as 

(pn ■= argmax Dn{4>). (9) 

<l>&^{L,J,m,M) 

(jin is our empirical divergence maximization (EDM) estimator. Note Dn{(f)) is not in general a 
KL divergence; it can in fact be negative for some (p G ^. It is a consistent estimator, however, 
converging to the "best in class" estimator as n — oo [3j . 



2.3 Best in class and optimal quantization rules 

The best in class estimate (p* is that element in <^ that maximizes D{(j)), 

(f)* := argmax D{(f)), where 

f r (10) 

D{(l)) := 1 + / log((/.) dP- (j) dQ. 

Note that D{(f>), as opposed to Dn{(p), is not an empirical quantity; its definition requires full 
knowledge of the distributions P and Q. 

We take the theoretically optimal quantization rule ip* to be the rule that maximizes the 
divergence over a class of piecewise constant functions that has an assumed boundary regularity 
(the regularity conditions play a role in the convergence analysis in Section 14.20 . The class 
definition uses the notion of a locally constant function: a function / : [0, 1]*^ i— )• M is locally 
constant at a point x S [0, 1]"^ if there exists e > such that for all y G [0, l]'^, the condition 
\\x — y|| < e implies /(y) = f{x). 

Definition (PC class |17j). A function f : [0,1]"^ i— )• {ci}^^,Q € is a positive-valued 
piecewise constant function with L levels if it is locally constant at any point x G [0, 1]"^ \ B{f), 
where B{f) C [0, l]'^ is a boundary set satisfying N(r) < I3r~'^'^~^^ for all r > 0. Here, (3 > is a 
constant and N{r) is the minimal number of balls of diameter r that covers B{f). Furthermore, 
let f be uniformly bounded on [0, 1]*^, that is m < f{x) < M for all x G [0, 1]°', where m > 
and M < CO. The set of all piecewise constant functions f satisfying the above conditions is 
denoted by PC(/3, m, M, L) . 

In short, we consider PC(/3, m, M, L) to be a class of likelihood-ratio quantization rules that 
have well behaved boundaries. The theoretically optimal quantization rule is thus defined to 
be 

ip* := argmax -D(V')- (11) 

■<p&C{l3,m,M,L) 
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It is well-known that ip* can always be constructed by thresholding the likelihood ratio [18]. In 
other words, the optimal quantization rule ■0* can always be chosen to be a piecewise constant 
function whose boundary sets are level sets of the likelihood ratio q{x)/p{x). 



3 Solving for the estimator 

To find (pn in ([9|), we employ a modified form of the Flynn and Gray algorithm [19j that 
iteratively maximizes the divergence between two pmfs over a set of quantization rules. The 
method directly follows from the EDM formulation (although it was not originally proposed in 
this context) and searches for an optimal cell labeling for a given partition where the number 
of cells is much larger than the number of quantization levels. 



3.1 The Flynn and Gray algorithm 



For independent and identically distributed random variables Xi, . . . , Xn, the empirical measure 
of a set A E [0, l]'', denoted Pn{A), is the sample average 



1 " 

PniA) = - y2uiXk). 



(12) 



fc=i 



The sample average of a function g : [0, 1]*^ i— t- R can thus be written with respect to P„ as an 
empirical expectation^ 



1 C 

Pn{g) = - J^5(^fc) = g dPn. (13) 
k=l ■' 

Using this notation, we rewrite ([S]) as 

Dn{^) = 1+1 \0g{^) dPn- f C/) dQn, (14) 
J[0,1]'* -'[0,1]'' 

where (p £ ^. For any fixed partition ttr G Upt, Dn{<p) is maximized by assigning 0(x) the 
values Pn[Ri) / Qn[Ri) for X G Ri. For this assignment choice, Dn{(p) can be expressed as 



L-1 



i=0 



PnjRi) 
Ri \Qn{Ri) 



dP„ 



PnjRi) 
, Qn{Ri) 



(15) 



and the estimator (pn can now be found by searching over 11^ for the partition that maxi- 
mizes ()15p . The Flynn and Gray algorithm is a straightforward method which accomplishes 
this task. To apply it, we rewrite as 



L-1 



Dn{(t>) = Y,Pn{Ri) 



i=0 

+ <5n {R. 

L-1 



log 



PnjRi) 
QnjRi) 

PnjRi) 



+ 1 



Qn jRi 
^ PnjRi)ai + QnjRi)h 



i=0 

L-1 



PnjSk)ai + QnjSk)bi 

i=0 keh 



(16) 

(17) 
(18) 
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where Oj = log[Pn{Ri)/Qn{Ri)) + 1, = -Pn{Ri)/Qn{Ri), and is the index set 
{k : label(S'fc) = i}. The algorithm maximizes Dn{(j)) by iterating two steps: it first holds the 
set of weights {oj} and fixed and finds the labels for each cell 5^ G vrj, /c = 0, . . . , 2'^'^ — 1 
that maximizes p8|) . and then holds the cell labels of ttj fixed and updates the weights {aj} and 
{bi} using the probabilities Pn{Ri), QniRi), i = 0, . . . , L — 1 found from the first step. Flynn 
and Gray showed these steps monotonically increase p8|) . and since Dn{4') is upper bounded by 
1 — m+log M (follows from the boundedness of (p), the algorithm converges to a local maximum. 
The algorithm returns a locally optimal labeling of ttj and locally optimal weights from which 
0ri can be determined: 

4>n{x) = —hi for X G Ri. (19) 

The algorithm is outlined in the panel entitled Algorithm [TJ An advantage of the Flynn and 
Gray algorithm is that it avoids the exhaustive combinatoric search over all possible labelings 
by only needing to examine each cell Sk once per iteration. From experiments, it has been 
observed that for moderate sized partitions ttj (< 2^^ cells) and for L < 10, the algorithm 
converges very quickly (< 30 iterations). 

EDM provides a new derivation for the Flynn and Gray algorithm; however, it is interesting 
to note that it can also be based on the fact that 

DKL{p\\q) = sup DKL{p{l)\\qm (20) 

7 

where the supremum is over all measurable quantization rules with an arbitrary number of lev- 
els [20l[2^. Because Dkl{p\\q) > -^kl(p(7) 11(7(7)) for any quantization rule, one could use (pOj) 
to justify an approach similar to EDM and maximize i^KL(p(7)lk(7)) over a set of quantiza- 
tion rules for a fixed quantization level. While this approach leads to similar (if not identical) 
estimators, EDM has the advantage of making a clear connection with empirical process theory 
which provides the theoretical tools to analyze the error convergence rate. 

Note also that the original Flynn and Gray algorithm does not explicitly constrain the 
values of (j) to lie within the range [m,M]. However, to avoid computing unbounded estimates 
at any given iteration, we employ the K-T technique |22J when computing Pn{Sk) and Qn{Sk), 
k = 0, . . . , 2'^"^ — 1. This technique simply preloads each cell Sk by one half before calculating 
the sample averages, thereby avoiding the possibility of computing zero probability estimates 
Pn{Ri), Qn{Ri)- Thus instead of (fT2]) . one computes 

PniSk) = ^.^^ + (21) 

and likewise for Qn{Sk)- Here, the choice of 1/2 is not arbitrary; it is based on theoretical 
considerations of what a priori distribution of the probabilities P{Sk), Q{Sk) influences the 
sample averages P„(5fc), Qn[Sk) the least [22] . 

In short, one solves for an EDM estimator based upon the training data by 
first computing the sample averages Pn{Sk),Qn{Sk) for each cell Sk G vrj and then providing 
these probabilities as input to the Flynn and Gray algorithm. The algorithm is applied to two 
numerical examples in Section [3 

3.2 Recursive dyadic partitions 

Because <I> is based on a uniform dyadic partition, any EDM estimate (j)n can be viewed as 
a piecewise constant function supported on a recursive dyadic partition (RDF). RDPs are a 
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Algorithm 1 Modified Flynn and Gray algorithm 
Input: L, J, n, empirical cell probabilities Pn{Sk) 
and Qn{Sk)-, stopping threshold e 
1: Initialize iteration index / = 
2: Randomly label cells G ttj 
3: Compute Pi^\Ri), Q^n\Ri), i = 0,...,L-l 
4: Initialize weights 

af^ =log{pP{R,)/Q^\Ri))+l 
bf^ = -Pi'\R,)/Qll\Ri) 
5: Compute Lii^V) 

6: Initialize intermediary divergence Dn{(f)) = 

7: while iD^\ct>) - Dn{(t>))/D^\<f>) > e do 

8: Find new label for each cell Sk by computing 

i'+i = argmaxig{o,...,i,_i}Pn('S'fe)af^ + Qn{Sk)hf' 

(hold weights fixed) 
9: Update probabilities for z = 0, . . . , L — 1 

Pn^^\Ri) = Sfe:label(5fc)=j-Pn('S'fc) 
QV\Ri)=Y.k :label(Sfc)=i Qn{Sk) 

(includes K-T preloading if necessary) 
10: Compute intermediary divergence 

5n(<^) = Ef=o' Pt'\R^)o^^ + Q^:.-''\R.)lll^ 
11: Update weights 

af^'^=\og{Pt'\Ri)/Qr'\Ri))+l 

bt'^ = -Pli^^'\R^.)/Qt'\Ri) 
12: Compute new divergence 

dH^'H^P) = Efco^ P^^^'\R^)at'^ + Qt'\R^)b?^'^ 
13: 1 = 1 + 1 
14: end while 

Output: (pn (locally optimal labels of ttj and 
weights {ai,bi}), £'„(^„) 
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Figure 1: An example two-dimensional RDP (J = 3). 



systematic class of partitions that have proven to be effective in function estimation and clas- 
sification problems |17ll23j . Their usefulness stems from their ability to adapt to boundaries 
(including PC(/3, m, M, L)), thus allowing efficient computation of estimators and concise en- 
coding of estimates. In the present context, RDPs are important because they allow efficient 
encoding of (pn, and their properties are key in the approximation error analysis presented in 
Section g^l 

RDPs are partitions composed of quasi-disjoint set£] whose union equals the entire space 
[0, 1]"^. A RDP is any partition that can be constructed using only the following rules [17] : 

1. {[0,1]^^} is a RDP. 

2. Let vr = {Sq, . . . , Sk-i} be a RDP, where Si = [uii,Vii] x . . . x [uid, Vid]- Then 

tt' = {So, . . . , Si-i, Sf, . . . , 5"^-^ ^\Si^i, . . . , Sk-i} 

is a RDP, where {S^, . . . , S^"^ is obtained by dividing the hypercube Si into 2'^ quasi- 
disjoint hypercubes of equal size. Formally, let g G {0, . . . , 2'^^^} and q = q\q2 ■ ■ - qd by 
the binary representation of q. Then 



S. 



(<?) 



Vil - Uil Uii 

Uii H qi,Vii H 



Vil 



(l-'Zi) 



Uid + 



Vid 



Uid , Uid 

Qd, Vid H 



Vid 



We say a RDP has maximal depth J if the side length of its smallest hypercube equals 2~"^. 
Fig. [T] illustrates a RDP approximating an elliptical boundary. 

It should be clear RDPs describe tree structures where the root node is the entire space 
[0, 1]"^ and the leaf nodes represent the different cells comprising the RDP. Each branch can 
have different depths and thus the cells can have different sizes. This property allows a RDP to 
have larger cells in locations where the function value is constant and smaller cells where the 
values change (around boundaries). The combination of the systematic tree structure and the 
partition's adaptivity allow the estimator to be efficiently encoded, that is, the number of bits 
necessary to map an observation to its quantized value can be done efficiently |23j . 

For a fixed estimator (or a fixed labeling of ttj), a RDP can be easily constructed by 
repeating step 2 above (starting with the whole space), but only producing a split if the cells 



^Two sets are quasi-disjoint if and only if their intersection has Lebesgue measure zero. 
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Sk G vTj on either side of the spht, but within the hypercube of interest, have different values 
(labels). 

4 Error Decay Rates 

We gauge the quality of by characterizing the decay rate of the estimation and approximation 
errors. Estimation error is defined as the difference D{(j)*) — D{(j)n) and quantifies the error 
caused by computing (pn without knowledge of p and q. As the number of samples n increases, 
the estimation error decreases at a rate (exponent of n) that depends on the complexity of $ 
and on the properties of p and q. Approximation error is defined as D{tp*) — D{(f)*) and arises 
in cases where ip* ^ ^. To quantify its decay, we think of the candidates rules (/> G <1> as being 
supported on RDPs and the rate of decay in terms of the depth parameter J. 

We begin in a standard fashion with two basic inequalities that follow from the definitions 
of (p* and (pn- D{(t)*) - D{(pn) > and Dn{(t)*) - Dn{4>n) < 0. They imply that the estimation 
error is upper bounded by a difference of empirical processes 

Q < D{(P*) - DQn) 

= -{Vn{(t)*) - Vn{<Pn))/\/n, 

where the second inequality results from adding and subtracting Dn^cp*) and Dn{4>n), and where 
^nil) = V^iDnil) — D{'y)). Adding the approximation error to both sides of the inequality 
bounds the total error by the two component errors. 

< D{r) - D{^n) 

^ V ' 

total error (22) 
<-{Vn{(tr)-Vn{h))/V^+D{r)-D{cP*) 

^ V ' V ' 

upper bound on est. error approx. error 

We use this equation and examine the estimation and the approximation errors separately, 
giving the final rate result for the expected total error. 

4.1 Estimation Error 

Let Ep and denote expectation operator with respect to P and Q. Then, by writing \vn{(pn) — 
Vn{(p*)\ as 

1 " 

- ^(log<A(Xf) - log,/.*(Xf)) - Ep[log<A(X) 
1=1 

- \og<l^*{X)] + {c^iXf) - cP*{Xf)) - E,[^(X) - cP*{X)] , 

it is clear that for a given quantization rule (j), the empirical averages above converge almost 
surely to their respective values by the strong law of large numbers. But because (pn can poten- 
tially be any element in <I>, any characterization of the convergence rate must hold uniformly 
over <I>. It is well-known that uniform rates of convergence depend on the complexity of the 
function class from which the empirical estimators are drawn [24]. Here we use the notion of 
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bracketing entropy to characterize complexity of ^. Roughly speaking, the bracketing entropy 
of a function class G equals the logarithm of the minimum number of function pairs that upper 
and lower bound (bracket) all the members in Q to within some tolerance 6 and with respect to 
some norm (a precise definition can be found in |24', p. 16]). We denote the bracketing entropy 
by Hb{6,G, L2{P)) and say Q has bracketing complexity a > if Hb{5,Q,L2{P)) < A6~" for 
all 5 > and for some constant A > 0. Because the members of ^ are uniformly bounded, it 
can be shown that $ has bracketing complexity a = 1 [25]. This fact is incorporated into The- 
orem[T]below; however, the proof of the theorem given in Appendix lA. II assumes the bracketing 
complexity lies between zero and two. The proof therefore yields a slightly more general result 
than that stated. 

We now introduce two conditions on p and q. The first simply states that p and q are 
uniformly bounded. 

Condition 1. Assume c < p{x),q{x) < C for all x G [0, 1]^, c > 0, C < oo. 

The second is a condition introduced by Mammen and Tsybakov [261127] and involves a 
key parameter k that provides insight into when fast convergence rates are possible (i.e., rates 
faster than n"^/^). The condition arises in a slightly different form in function estimation and 
Bayesian classification problems, and within these contexts, it can be related to the behavior of 
p and q near a boundary of intereslH. For example, in Bayesian classification, small k implies 
a "steep" regression function at the Bayes decision boundary and thus easier classification; 
large k implies a "fiat" transition and harder classification. Van de Geer [2j describes k as an 
"identifiability" parameter in the sense that it characterizes how well (/> G $ can be distinguished 
from tp* . Because is determined by P and Q, this condition is ultimately a condition on 
these underlying distributions. 

Condition 2. There exists constants K > Q and k > 1 such that for all G <I>, 

D{r)-D{<l^)>\\r-ct>\\ljK. (23) 

If K is small (close to 1) for a given P and Q, the difference D{%1)*) — D(<j)) is larger for 
(j) G ^ close to ip* (those (p such that \\ip* — 4>\\l2 ^ 1) compared to those distributions having 
larger k values. Intuitively, this means that such (p are more distinguishable from ip* for those 
distributions satisfying Condition 2 with small k compared to those distributions satisfying 
Condition 2 with larger k values, where distinguishability is measured in terms of divergence 
loss D[ij)*) — D((/>jl. The following result shows that k effectively characterizes this aspect of 
the problem, and like for Bayesian classification and estimation, is a key parameter for the error 
convergence rate. 

Theorem 1 (Estimation error). Let (pn, P* , and ip* be as defined in ([9]), (llOh and (jlip respec- 
tively. Suppose Conditions 1 and 2 are met for some constants c, C, iT, and k. Then for any 
< e < 1 we have 

D(V^*)-ED(0„)< 

const(c, C, K, k) n~3^ + D{i^*) - D{p*) , 

for sufficiently large n where const{c,C, K, k) is a decreasing function of e. 

^In Bayesian classification and function estimation, tlie condition is known as tlie margin condition 

^Note tliat tlie distinguishability in terms of divergence loss is intimately connected with how (pn is computed: 

since Dn{(j)) is a surrogate of D{(j>), maximizing Dn{(f>) over £ <1> is a surrogate for maximizing D{(j)) over </!> £ $, 

or equivalently minimizing D(xp*) — D{())). 
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10° r I 0.07 



0.06 




n (number of sarnies) 



Figure 2: Left: Total expected divergence loss D{ip*) — ED((/)„) plotted as a function of the 
number of training samples n (solid curve) for the case L = 8, J = 6 where p{x) is a zero-mean 
unit variance Gaussian density and q{x) is a zero-mean unit variance Laplace density. The 
dashed (black) curve is 0{n~^), thus for this example, we have a fast rate of decay. Right: 
Probability density functions. 



The proof is provided in Appendix lA.ll (see also [28]) and directly follows from results of 
van de Geer [S] pp. 206-207] [21] and Mammen and Tsybakov PU] . 

Depending on P and Q, the decay rate of the estimation error can be as fast as {k = 1) 
and no worse than n~^/^ {k = oo). In particular, if for a given P and Q, the approximation 
error is nonzero (which is commonly the case in quantization problems), we have 

D{r) - D{<p) > D{r) - D{4>*) 



> const • 




where the first inequality follows from the definition of (p* and the second from the fact that 
< aP^^^ — ^ all (/> G Thus with a nonzero approximation error, Condition 2 can 
be met with k = 1 and the rate is achievable. This situation is common in quantization 
problems because it is unusual in practice for the level sets of a likelihood ratio function to 
coincide with a RDP for a fixed depth J. For example, consider the simple scenario where p{x) 
is a zero-mean unit variance Gaussian density function and q{x) is a Laplace density function 
(also zero-mean and unit variance), x G M. With L = 8 and J = 6, the approximation error 
is 0.002 and hence we expect a rate of 0{n~^). Fig. [2] confirms this result experimentally by 
plotting D{ip*) — E,D((j)n) as a function of n. (For each value of n, Gaussian and Laplacian data 
were generated and 0„ was computed using the Flynn and Gray algorithm.) The black dashed 
curve on the left hand plot is shown for reference and equals 40n~^ + 0.002. 

In contrast, if P and Q are such that ip* £ <I>, then Condition 2 is only met with k = 2, 
and therefore the resulting decay rate is 0{n~'^/^). To derive this result, consider the subset of 
quantization rules </> G $ that share the same partition associated with tp*. Recalling we 
can in this case write ip* as 

L-l 

V^*(x) = ^V*1h.(x), 

i=0 
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where {■0*} are the levels of V'*- Then by ([10]) and the fact that 4^* = P{Ri)/Q{Ri), we have 

= 1 + y log (ip*) dP- J^* dQ 
\ogi)* dP 



V'* log (0*) dQ 

L-l .. 

/ r^\ogm dQ. 



' Ri 

Now, because (•) log (•) is differentiable and continuous on the range of the positive real numbers, 
we can by Taylor's Theorem [29] expand tp* log tp* on Ri around cr. for each i to obtain 



where TZi = -^{4'* — CR-)'^,i = 0, . . . , L — 1, are the Taylor remainders of the expansions with 5i 

lying in between ip* and c/j. . By adding and subtracting Jlog{<p) dP, (I24p can be rearranged 
to yield 

D{r) - D{(p) = V / naQ 



2M 



>— E / i^:-CRfdx (25) 



(26) 



where the inequality follows from replacing 5i with M in the remainder term and using the 
fact that q is lower bounded by c (Condition 1). Thus, when there is no approximation for 
the given distributions P and Q (and for given values of J, L, m, and M) the best guaranteed 
convergence rate is 0{n^^^^). Intuitively, this is reasonable since among those (p that share 
the same partition as ip* , it is harder to distinguish ip* compared to the case where there is a 
nonzero approximation error. 

Nguyen, Wainwright and Jordan reported a similar result to Theorem [T] in [3] . In their 
investigation, they used an empirical estimator of the same form as ([9]), but did not con- 
sider quantization, nor did they incorporate a margin condition like Condition 2 into their 
formulation. They considered a class of (inverse) likelihood ratio functions J- that satisfies a 
complexity condition like Condition 1 and found that the difference D{f*) — Dn{fn) decays as 
0(-^-i/(2+q)^)^ where !)„(•) and D{-) are as defined in ([8]) and ([TO]), /* G J" is the best in class 
likelihood ratio function, and fn is an empirical estimator similar to ([9]). Note that this rate 
is strictly less than the rate in Theorem [T] even if k is eliminated from the formulation (take 
K — >■ oo). 



4.2 Approximation Error 

The approximation error analysis also requires that we now think of <I> as a class of piecewise 
constant functions (quantization rules) supported on RDPs. As discussed in Section 13. 2| this 
is fully consistent with the definition given in ([7]). With this in mind, we have the result: 
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Theorem 2 (Approximation error). Let ^{L, J,m, M), (p* , and ip* be as defined in (|10p . 
and (|lip respectively. Suppose that Condition 1 is met for some constants c and C. Then the 
approximation error is bounded as 

D{i;*) - D{(j)*) < const(/3, c, C, m, M, L) 2""^. (27) 

The proof of this result is given in Appendix lA.2l (see also |30]). It follows a related, function 
estimation result in [T7] with one important exception: the KL divergence is not additive, thus 
unlike a mean squared error metric, the approximation error D{'ip*)—D{(j)*) cannot be quantified 
cell by cell. Details are provided in the proof. 

The combination of Theorem [T] and Theorem [2] gives the decay rate of the total expected 
error in terms of the number of training samples n and the depth J of the uniform dyadic 
partition vrj. To balance the errors and obtain a rate only in terms of n, one can express J as 
a function of n. Setting J = [k Inn/ (2k — 1) In 2] yields the final result 

Diip*)- ED{^n) < const • n"^ , (28) 

for sufficiently large n. 



5 Application: Quantization under communication constraints 

When signals are measured and digitized at one location but processed at another, communica- 
tion of the data is necessary. Because of ever present power, computing, and rate constraints, 
the raw data cannot be transmitted in full fidelity; instead a summary of the data is sent. 
When the ultimate goal is classification or detection, one strategy to maximize performance 
and minimize communication costs is to heavily quantize the data such that the KL divergence 
is maximized. This is perhaps the simplest strategy and hence attractive when communications 
are severely constrained. Optimal likelihood-ratio partitions can be very different from typical 
nearest neighbor (Voronoi) partitions that are associated with quantizers designed to minimize 
mean squared error (see Figs. [3] and S]). Nevertheless, past work in quantization for classification 
has forced a small-cell property in the design strategy resulting in partitions resembling nearest 
neighbor partitions [H]. Consequently, optimal partitions with disjoint regions, for example, 
cannot be well-approximated by these methods. The EDM quantization method overcomes this 
shortcoming. 

As an illustration, we consider P to be a zero-mean bivariate Gaussian distribution and Q 
to be a zero-mean bivariate Laplace distribution, both with identity correlation matrices; P and 
Q thus differ only in their basic shapes. The plot of the likelihood ratio in Fig. O shows that 
the boundaries of the optimal likelihood-ratio partition are concentric circles in each quadrant. 



Fig. 4(a) depicts the best in class quantization rule along with its associated RDP in Fig. 4(b) 
The result was generated with the Flynn and Gray algorithm but with Pn{Sk) and Qn{Sk) 
in Algorithm [1] replaced by P{Sk) and Q{Sk)- (Data points lying outside of [—5,5]^ were 



simply ignored.) Convergence occurred in 8 iterations. Fig. 4(c) shows the empirical estimator 
generated from training sets each of size of two million samples. In this case, the Flynn and 
Gray algorithm converged in 11 iterations. 

In comparison to the best in class quantization rule. Fig. 0] shows the effect of the trying 
to estimate P and Q on vrj for low probability regions (corner regions). In other words, the 
lack of data within these regions makes approximating P and Q on vrj difficult, especially by 
empirical averages. More sophisticated density estimation methods would improve this aspect 
of the estimator, such as kernal based methods. The estimator might also be improved if one 
approximates P and Q on a (data-dependent) RDP instead of on vrj (see e.g., [Ij). 
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-5 -5 



Figure 3: Plot of a likelihood-ratio function of a zero-mean bivariate Gaussian probability 
density function and a zero-mean bivariate Laplace probability density function. The level sets 
of this function, which are concentric circles centered in the quadrants, form the boundaries 
of an optimal likelihood-ratio partition. Such partitions are not well-approximated by optimal 
nearest neighbor partitions. 

6 Conclusion 

In summary, EDM quantization provides a means of finding quantization rules, or more gener- 
ally, low dimensional transformations that best preserve the divergence between two hypothe- 
sized distributions. EDM estimators can be computed using the Flynn and Gray algorithm, and 
they can exhibit fast error convergence rates as a function of the number of training samples. 
The EDM formulation benefits from its connection to empirical process theory and possesses 
the flexibility to overcome the necessity of a small-cell constraint and allow efficient encoding. 
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A Appendices 

A.l Proof of Theorem [T] 

The proof proceeds by considering the behavior of \i'n{4'n) — '^ni4'*)\ from (|22p as a function of 
the L2 distance between (pn and cp* . This is done by considering a weighted empircal process 
for two different cases depending on the value of ||(/>„ — 0*||i2. The different cases yield different 
rates of convergence, thus they must be treated separately. At the heart of the argument is 
Lemma 3, a concentration inequality result by van de Geer [23], Lemma 5.13] concerning the 
supremum of weighted empirical processes (supremums are considered because we want uniform 
convergence). The application of this result is not trivial, hence most of the proof is geared 
toward formulating the problem properly, most of this is done in Case 1 of the proof and Lemma 
2. Case 1 is a slight modification of a proof found in [21 pp. 206-207]; Lemma 2 is original. For 
more information regarding empirical process theory see [21] . Lastly, as stated in Section 14. H 
the proof only requires the bracketing complexity of $ satisfy < a < 2. 
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(a) Best-in-class rule 



-5 5 

(b) Associated best-in-class RDP 




(c) EDM estimator 




(d) Associated estimator RDP 



Figure 4: Best-in-class and EDM quantization rules, and their associated recursive dyadic 
partitions when P is bivariate Gaussian and Q is bivariate Laplace, L = 4, J = 6. Note that 
the different cell labelings (colors) are inconsequential in terms of the divergence. 
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Define the random variable 



C{°+2)/4| 



where xVy = max (x, y). We consider the following two cases: (1) 
and (2) Un - < C-i/2„-i/(2+a)_ 



\L2 > 



C-V2 



n 



(29) 

-l/(2+a) 



Case 1. Under this case (I29p simplifies to 



where (3 = 1 — a/ 2. For 
have 



^ ^ I Z^n -t^n I 

" C(°+2)/4||^„-<^*||^^ 

(PUj) is defined to be zero. Recalling the inequality 



< Z„C("+2)/4| 



mi 



n 



+D{r)-D{<p*). 



Condition 2 implies 



W - 0*|li < K^/^iDiil)*) - D{(j)*)f/''. 
Hence, by the triangle inequality and ()32p . we have 



nl<\\r-^n\\l + 



1^ 



+ Kl^/^{D{%b*)-D{^*)f/''. 
Using ([33]) in ([3T]) . shows D{il)*) — D{(j)n) is less than or equal to 

K'^/'^n-^/-2Zn{D{%b*) - D{(l)*)f/^] + D{%b*) - D{^*). 
We now apply Lemma [T] to each of the terms within the brackets to obtain 

D{r) - D{4>n) < e\{D{r) - D{4>n)) + {D{r) - Dm) 



K 



+ 2C^^{ — )^^Zr^n + Ditp*) - D{(j)*). 



By rearranging the previous expression and dropping a factor of < 1, we have 

+ 



D{r) - D{(pn) < 



1 - e 



(30) 



we 



(31) 



(32) 



(33) 



(34) 
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For any r > we have by Jensen's inequality and Lemma [2] that 



(35) 



Taking the expectation of (jM]) and applying ([35]) . we conclude 

+ 



D{r)-^D{<l)n) < 



1 - e 



D( 



for ||0„ - </.*||l, > C7-i/2„-i/{2+a)_ 
Case 2. For this case, 

From the fundamental inequality 



Wni4>n) - t^n(0*)| 
n-(2-a)/2{2+a) ' 



+ D{r)-D{^*) 

< Z„n-(2-")/2(2+-)„-l/2 

+ D{r)-D{<p*) 

Taking the expectation of (I36p and applying Lemma [2] yields 

- EZ)(^„) < C3 + Z)(v,*) - D{<\>*), 



(36) 



(37) 



for ||0n — </'*||l2 ^ ^ 1/(2+") _ xhe rate attained in (I37p is (strictly) faster than that attained 
in Case 1 (n-2/(2+") < ^-K/(2(K-i)+a) for k > 1 - a/2, < a < 2). Therefore, the decay of the 
total divergence loss is governed by the slower rate found in Case 1. ■ 

Lemma 1 (Tsybakov and van de Geer [3T|.van de Geer [2j). have for all positive v, t, and e, 
and K > /3, 

vt^/'^ < et + v^e~^ . 

Lemma 2. Let 4>n, <P* , <ind tp* he as defined in Q, (jlOp . and (jllh respectively. Then under 
Conditions 1 and 2, we have 



E sup 

\<t>&^:\\<j}-<t>*\\L2>C-^/^n~^/^^+'=^) 



\h'n{4') - ^n{<P*)\ 



< 



(38) 



and 



E 



sup 

,,<C-i/2„-i/(2+a) n 



kn(</>) - ^n{(t)*)\ 



-(2-a)/2(2+Q) 



< 



C3, 



(39) 



for some positive constants C2,r, and C3. 
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Proof. Case 1: Equation (j38p . To compact notation, let 



denote the inequality 



|m„((/)) - Un{(t)*)\= Vn 



I {log Ct)- log Ct>*) d{Pn-P) 



+ /(</)- r ) d{Qn - Q) 



By Condition 1, we have 



-^*\\L,iP)<C^'^\\<p-<P*\\L, 
nL,iQ)<C^'^U-<i>*\\L,, 



for (/> G $. Consequently, we can write 



sup 



|l-a/2 



< sup 



I (log(/>-logr) d{Pn-P) 



\L2 



+ sup 



y - r ) d{Qn - Q) 



^*||l-"/2 
IIL2 



< sup 



j (log0-log</.*) d{Pn-P) 



0e$:^ C'(°-2)/4||0-(/,*l|l-"/2 



+ sup 



L2(P) 
d{Qn - Q) 



'-"^ir 1 /o 

</,e<E.:h C("-2)/4||^-^*|| 7/2 



(40) 



(41) 



L2iQ) 

where the last inequality follows from (j40p . 

We now want to apply a probability inequality due to van de Geer [53] (stated as Lemma El 
below) to the two terms in ()4ip . The result requires <I> and <I> = {log(/> : (f) £ ^} to have a 
bracketing complexity satisfying < a < 2. $ satisfies the requirement by construction which 
implies the same is true for <I>. The result also requires that the differences (log cp — log (j)*) and 
{4> — 4>*) are upper bounded. This follows from the definition of ^. Furthermore, note that the 
proper form of the condition under the supremum follows from MOD . 

Applying Lemma [3] to each term in (j4ip . we obtain 



Pr sup 



I (log0-log,/.*) d{Pn-P) 



|l-a/2 
Il2(P) 



> C("-2)/4t 



< c exp 
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and 



Pr sup 



y"(0 - r ) d{Qn - Q) 



a-a/2 
li2(Q) 



t 

< C exp :r 



for all t > c, some constant c > 0, and n sufficiently large. Consequently, 



E sup 



(logcA-log^) d{Pn-P) 



and 



E sup 



|1-q:/2 



< C2,l 



/,*||l-"/2 



< C2,2 



for all r > and some finite positive constants C2,i and C2,2- Therefore 

/ \ r 

\yn{<t') - ^n{(t>*)\ 



E sup 

for some finite positive constant C2- 



,l-~a/2 
\L2 



<4, 



Case 2: Equation 1^. Let b denote the inequality ||(/) - cI)*\\l2 < C~^/2„-i/(2+a)_ ^j.^^^ ^j^g 
definition of the empirical process — J^nC'?!'*)!, we have 



< sup ■ 



I (log <A- log (A*) d{Pn-P) 



„-(2-a)/2{2+Q) 



+ sup ■ 



(42) 



^-(2-Q)/2(2+a) 

Apply Lemma [3] to each of the terms in (fi2|) to get 



Pr sup 



y" (l0g(/.-l0g,/.*) d{Pn-P) 



> tn 



-2/(2+a) 



< c exp 



tn^+'' 
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and 



Pr sup / 



< c exp 



d{Qn - Q) 

a. \ 



>i^-2/(2+a) 



for alH > c and for n sufficiently large. Therefore, for n sufficiently large 



for some positive constant C3. 



□ 



Lemma 3 (van de Geer [23], Lemma 5.13). Let Xi, . . . ,X„ he an independent and identically 
distributed sequence of random variables on a probability space {X,A,P). Let Q C L2{P) be a 
collection of functions and define the empirical process indexed by Q as 



[Mq) = j g d{Pn-P): geg]. 



Let \g\oo = sup^g_:^' |5'(a;)| denote the supremum norm and suppose sup^gg \g — 50I00 < K, for 
some fixed element go & G and some constant K . Furthermore, suppose 

HB{6,g,L2{P)) < A5-P, foraU6>0, 

for some < p < 2 and some constant A > 0. Then for some constant c depending on p and 
A, we have for all t > c and for n sufficiently large, 



Pr sup 

^9^6, ||g-9o||<" 



j{g- go)d{Pn - P) 



> tn 2+p 



< c exp 



tn'^+p 



and 



Pr 



sup 



Wnig) - '^nigo)\ 



II 11^ ib-s'or 2 



t 



> t 



< c exp 



where the norms \\g — go\\ are norms in L2(P). 



A. 2 Proof of Theorem H 

Recall the definition of PC(/3, m, M, L) from Section [4.2i We will need the following lemma. 

Lemma 4 ( [T7], Lemma 5, p. 121). There is a RDP such that the cells intersecting B^tp*) are 
at depth J and all the other cells are at depths no greater than J. Denote the smallest such 
RDP by TTj. Then vr} has at most 2'^'^/32^'^~^'>-^ cells intersecting B{7p*). 
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Let (p' denote the L-level piecewise constant function defined by, 

L-l 



i=0 



(43) 



where each member region i?^ of the partition {R'j} is composed of a union of cells S G vr}. 
Furthermore, let satisfy the condition that the cells S & it*j contained in R[/ B{tp*) are also 
contained in A*. More concisely, we write S C R'-/B{'ip*) =^ S Q A*/B{^lJ*). In words, this last 
condition means that the partitions {R'^} and {A*} coincide except possibly on the boundary 

First, observe that D{il)*) — D{cl)*) < D{ijj*) — D{4)') since the divergence between the 
pmfs induced by (j)' is necessarily less than or equal to the that induced by the best in class 
quantization rule (j)* . (This inequality also follows from the Data Processing Theorem [H pp. 18- 
22].) 

Next, upper bound the difference D{ip*) — D{(j)') by the Li-norm of {ifj* — (j)*): 

D{r) - D{(p') = [ log ^ dP - / ir - (p')dQ 

< jCj-^) dP- Jir-cpVQ 
= J ^ir-^')dp- J\r-<i>')dQ 



< 



C 



< 



m 
C + cm 
m 



{ip* - (l)')dx 



+ c 



(V-* -(t)')dx 



(44) 



where the first inequality follows from the fact that logx < x — 1, for x > 0, and the second 
inequality follows from the bounds on 0, and q. 
Rewrite the Li-norm as 



L-l 



\'4)*{x) — (/''(x)l dx 

J2 [ \rix)-<p'ix)\dx 

\i^*{x)-^'{x)\ dx 



i=0 
L-l 

E 

1=0 



E 



+ E 

sci?'^{ij(V*)) 



\ij*{x) - <j)'{x)\ dx 



(45) 



Here, S C R^/B{tp*) means all cells S that are a subset of R[ which do not intersect the 
boundary B['ip*). Similarly, S C R[[B['ip*)) means all cells S that are subsets of R[ which do 
intersect B(ip*). 

Consider the second summation within the brackets in (^51) . By the boundedness assump- 
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tions on ip* and (p', the integrand can be upper bounded by M. Therefore, 

I \r{x)-ct>'{x)\ dx 



<M Vol(5) 



(46) 



where the second inequahty follows from Lemma H] and the fact that the volume of one cell S 
is 2-'^-^. 

Now, consider the first summation within the brackets in (j45|) . For all S R'- (and in par- 
ticular for all S C R'jB{il)*)), <j)' equals Q{R[)/P{R[) (recah (03])). Likewise, by the definition 
of (/)', V* is also constant for all S C R'-/B{ip*). Therefore, we have 



\'4'*{x) — (j)'{x)\ dx 
P{A*) P(i?9 



L-1 

= E 

L-1 

1=0 
L-1 

i=0 



P{A*)Q{R[) - P{R[)Q{A*) 



Q{A*)Q{R[) 
P{A*)Q{R[) - P{R[)Q{A*) 



\o\{R!,) 



Using the inequalities, 



Q(i?0 < Q(A*) + E ^(^) 

p(i?o > p(a:) - E ^(^)' 

SGi?UB{A*)) 

we upper bound each term in the summation in (|47|) 

1 



|p(4*)Q(^D-n^DQ(4*)| 



< 



1 



< 



Q{Ai) 
1 I 

C7/3'2--^ 



p(yi:) E ^(^) 

5Ci?;{iJ(A*)) 

) E ^(^) 

SCil'^(B{A*)) 

P(yl*)'^/3'2"^ + Q(A*)C/3'2~^| 
P{A*) + Q{Al) 



(47) 



(48) 
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where the second inequality follows from Lemma H] with /3' = 2^*^/3. 
Summarizing, we have 



L-l 



E E 

i=0 SCR'JB{ip*) 



|V'*(x) dx 



<-(- + l)/3'L2-^, 
c V c / 



where the last step follows from the assumed bounds on p and q. 
Finally, by combining (jM]), ([15]) . P^ . and P^ . we conclude 



./.'lU, </3'L[M + (C/c)(C/c + l)] 2 



-J 



and 



-D{ct>' 
C + cm 



< 



m 



^'L[M+{C/c)iC/c + l)] 2- 



(49) 



(50) 



(51) 
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